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Abstract 

In multiple-input multiple-output (MIMO) fading channels maximum likelihood (ML) detection 
is desirable to achieve high performance, but its complexity grows exponentially with the spectral 
efficiency. The current state of the art in MIMO detection is list decoding and lattice decoding. This 
paper proposes a new class of lattice detectors that combines some of the principles of both list and 
lattice decoding, thus resulting in an efficient parallelizable implementation and near optimal soft-ouput 
ML performance. The novel detector is called layered orthogonal lattice detector (LORD), because it 
adopts a new lattice formulation and relies on a channel orthogonalization process. It should be noted that 
the algorithm achieves optimal hard-output ML performance in case of two transmit antennas. For two 
transmit antennas max-log bit soft-output information can be generated and for greater than two antennas 
approximate max-log detection is achieved. Simulation results show that LORD, in MIMO system 
employing orthogonal frequency division multiplexing (OFDM) and bit interleaved coded modulation 
(BICM) is able to achieve very high signal-to-noise ratio (SNR) gains compared to practical soft-output 
detectors such as minimum-mean square error (MMSE), in either linear or nonlinear iterative scheme. 
Besides, the performance comparison with hard-output decoded algebraic space time codes shows the 
fundamental importance of soft-output generation capability for practical wireless applications. 
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I. Introduction 

Wireless transmission through multiple antennas, also referred to as multiple-input multiple- 
output (MIMO) radio, is currently enjoying a great popularity as it is considered the technology 
able to satisfy the ever increasing demand of high data rate communications. In MIMO fading 
channels and in presence of additive white Gaussian noise (AWGN), maximum-likelihood (ML) 
detection is optimal [1]. A straightforward implementation of the ML detector would require, 
for an uncoded complex constellation of size S and L t transmit antennas, an exhaustive search 
over all possible S Lt transmit sequences, thus being prohibitively complex for high spectral 
efficiencies. This observation justifies the intense interest in reduced complexity, sub-optimal 
linear detectors like zero-forcing (ZF) or minimum mean square error (MMSE) [2]. These 
algorithms currently represent the practical state-of-the-art for MIMO coded systems, as they can 
easily generate bit soft output information for use with powerful coded modulations. It should 
be noted that ZF and MMSE with spatial multiplexing offer diversity of L r — L t + 1, where L r is 
the number of receive antennas, while optimum detection provides L r [3]; thus, linear detectors 
are highly suboptimal. Nonlinear detectors based on the combination of linear detectors and 
spatially ordered decision-feedback equalization (O-DFE) were proposed for V-BLAST in [4], 
[5]; they offer some performance improvement, but suffer noise enhancements due to nulling 
and error propagation due to interference cancellation (IC). More interesting for bit interleaved 
coded modulation (BICM) systems are soft-output iterative MMSE and error correction decoding 
strategies, in either "hard" IC (HIC) [6] and soft IC (SIC) [7], [8] schemes. However they suffer 
from latency and complexity disadvantages. 

To our knowledge, the class of ML approaching algorithms is quite limited. Two important 
families are the list-based detectors [9], [10], [11], [12], [13], based on the combination of ML 
and DFE principles, and the lattice decoders, among those the sphere decoder (SD) [14] is most 
well known. 

The common idea of the list-based detectors (LD) is to divide the streams to be detected into 
two groups: first, one or more reference transmit streams are selected and a corresponding list of 
candidate constellation symbols is determined; then, for each sequence in the list, interference 
is cancelled from the received signal and the remaining symbol estimates are determined by 
as many sub-detectors operating on reduced size sub-channels. The IC process is analogous to 
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V-BLAST spatial DFE; the differences lie in the criterion adopted to select the reference layer(s) 
and to order the remaining ones, and in the fact that its initial symbol estimate is replaced by a 
list of candidates. If the list includes all possible constellation points [9], [10], an initial stage 
of interference nulling is not required; this interference nulling is still performed if a reduced 
size sorted list is generated starting from the ZF estimates [11], [12]. The final hard-decision 
sequence is selected by minimizing the Euclidean distance (ED) metrics over the considered 
sequences. A particularly interesting result was obtained searching all possible S cases for a 
reference stream, or layer, and adopting O-DFE for the remaining L t — 1 sub-detectors. If a 
properly optimized layer ordering technique is utilized, numerical results reported in [10], [11] 
demonstrate that the LD detector is able to achieve full receive diversity and a degradation 
from ML performance of fractions of a dB. A notable property is that this can be accomplished 
through a parallel implementation, as the sub-detectors can operate independently. The drawback 
is that the computational complexity is high as L t O-DFE detectors for L t — 1 sub-streams have 
to be computed. If efficiently implemented [12], it involves 0(Lf) complexity. Another major 
shortcoming of the prior work in list based detection is the absence of an algorithm to produce 
soft bit metrics for use in coding and decoding algorithms. 

Lattice detectors use the linear nature of the MEMO channel to form a reduced complexity ML 
search. Lattice detectors are suitable for systems whose input-output relation can be represented 
as a real-domain linear model 

y r = Bx r + n r = s r + n r (1) 

where the information sequence x r is uniformly distributed over a discrete finite set C C R m , 
n r G R n represents the noise vector, B is a n x m real matrix where n = 2L r and m = 2L t . 
The output signal vector s r £ A C R n . B represents the channel mapping of the transmit signals 
into the m-dimensional lattice A and is also referred to as the lattice generator matrix. If the 
noise components are independent and identically distributed (i.i.d.) zero-mean Gaussian random 
variables (RVs) with a common variance, as typical of communication systems, then the ML 
decoding rule corresponds to solving the minimization problem: 

x r = argmin ||y r — Bx r || 2 (2) 
x r ec 

where || • || denotes the vector norm. The problem in © is a constrained version of the closest 
vector problem (CVP) in lattice theory. A survey of closest point search methods has been 
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presented in [15]. It should be noted that in the general case the model of (HJ) is still valid if a 
general encoder matrix G G R mxm is considered such that: 

x r = Gu r (3) 

where u r GWc R m is the information symbol sequence and x r is the transmit codeword. In this 
case the lattice generator matrix becomes BG. In the rest of the paper we will refer to (HI with 
no loss of generality. The lattice formulation for MIMO wireless systems was described in [16] 
in case of quadrature amplitude modulation (QAM) digitally modulated transmitted symbols. 

SD can attain ML performance with significantly reduced complexity. Some efficient variants 
of the algorithm are summarized in [17]. SD presents a number of disadvantages, which can be 
summarized as follows: 

- As the in-phase (I) and quadrature-phase (Q) components of the digitally modulated QAM 
symbols are searched in a serial fashion, SD is is not suitable for a parallel VLSI implemen- 
tation. As a support to our claim, some papers have described the SD operations in terms 
of tree search and recently, the equivalence between the SD and the sequential decoder has 
been established rigorously [18]. 

- The number of lattice points to be searched is a random variable, sensitive to the channel 
and noise realizations, and to the initial radius. This implies a non-deterministic complexity 
and latency, not desirable for real-time high-data rate applications. Further, SD complexity 
has often been referred to as polynomial but a recent work [19] shows that SD remains an 
efficient solution only for problems of moderate size, and for SNR values in given ranges. 
This motivated the proposal of additional optional front-end processing to expedite the lattice 
search. However, lattice reduction (LR) techniques such as the Lenstra-Lenstra-Lovasz [20], 
[21], [15] did not prove very useful to solve the constrained ML problem 0, as reported 
in [17], because they distort the original lattice and boundary control becomes difficult. 
These lattice reduction techniques are useful in conjunction with additional processing 
stage, i.e. the MMSE "generalized decision feedback equalizer" (MMSE-GDFE) [18]. When 
channel is slowly varying, these front-end processings are very effective on reducing the 
overall complexity. However, if channel changes significantly from block to block, front-end 
processings can have a significant impact on the overall complexity. 

- Generation of soft output metrics is not easy with known lattice decoding algorithms. A 
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solution was proposed in [22] and successively refined in [23], where bit log-likelihood 
ratios (LLR) are computed based on a "candidate list" of sequences. No simple rule to 
determine the optimal size of such list was proposed; simulation results show that it can be 
very high (thousands of lattice points). This may nullify the benefits of using SD, in terms 
of complexity, as also evidenced in [24]. This observation is one of the main motivations 
of this work. Also, in case of use of LR techniques the problem of soft-output generation 
becomes prohibitively complex. 
It should be noted that nulling and cancelling or equivalently ZF-DFE, besides being the core 
of the O-DFE algorithm [5], is also an important part of the SD operations. ZF-DFE can be 
efficiently implemented through a QR decomposition (QRD) of the channel matrix, as shown in 
[25], [9] and [17] for O-DFE and SD respectively. 

In an attempt to retain the advantages of LD and SD algorithms at the same time addressing 
their main drawbacks, a novel MIMO lattice detector is proposed in this paper. This algorithm 
is given the name layered orthogonal lattice detector (LORD) [26], [27]. 

Similarly to SD, LORD consists of three different stages. First, the system is represented 
through a proper lattice formulation but different than the only one proposed for SD [16]. 
Second an efficient preprocessing of the channel matrix is implemented for ZF-DFE. While a 
standard QRD could be employed without altering the performance of the detection algorithm, 
a more computationally efficient Gram-Schmidt orthogonalization (GSO) process is outlined in 
this work. The last stage is the lattice search, which involves finding a proper subset of transmit 
sequences to solve the problem ©. The number of lattice points to be searched is linear in 
the number of transmit antennas and is easily modified to provide soft output bit metrics. The 
innovative concepts compared to SD, and already embedded in LD [10], is that the search of the 
lattice points can be accomplished in a parallel fashion, and their number is fully deterministic. 
LORD achieves a huge complexity reduction over the exhaustive-search ML algorithm and, as 
proven via numerical results, also obtains a better complexity-performance tradeoff than SD. 

The proposed GSO technique avoids computing a complete QRD multiple times if the channel 
columns are permuted, as clear from the sequel. Compared to the best performing and efficiently 
implemented LD ("B-Chase" detector [12]) that relies on multiple QRDs, LORD algorithms has 
the following advantages. Its preprocessing is less complex - 0(Lf) for L t = L r , instead of 
0(Lf)); LORD generates reliable bit soft output information; it does not require any particular 



February 1, 2008 



DRAFT 



6 



ordering scheme yet still retaining near-ML performance in BICM systems. Nevertheless, it 
should be noted that concepts like reduced complexity lattice search and optimal ordering 
schemes can be applied to LORD as well, with proper adaptations to real-domain; the explanation 
of these ideas is deferred to later works. 

For two transmit antennas, LORD achieves ML hard-output demodulation and is able to 
compute optimal (max-log) bit LLRs. For more than two transmit antennas, the algorithm is 
suboptimal but still near-ML in BICM systems and its gain over MMSE-based linear and iterative 
nonlinear detectors actually increases with the dimensionality of the problem, thanks to a good 
exploitation of receive diversity. As shown later in this paper, even one single stage of LORD 
processing performs better than several iterations of MMSE-SIC in various orthogonal frequency 
division multiplexing (OFDM) BICM systems, with clear latency advantages. Iterative decod- 
ing and LORD detection schemes represent a promising topic for future work. Overall, these 
results suggest that soft-output MIMO near-ML detection, so far considered as computationally 
intractable for real-time high-data rate applications, can become a viable technique for next 
generation wireless communication systems. 

To conclude this section, it should be mentioned that no efficient soft-output ML decoding 
strategy has been proposed so far for full diversity full data rate algebraic space-time codes 
(STCs) like the Golden Codes (GC) [28]. The performance comparison of layered BICM systems 
and uncoded GC provided in Section |V| shows that soft-output ML detection and ECCs are 
essential in order to exploit the high-data rate and high link robustness promised by MIMO for 
next generation wireless applications (like wireless local area networks (WLANs), undergoing 
standardization as IEEE 802.1 In [29]). 

The rest of the paper is organized as follows. In Section |n| the system notation and the novel 
lattice formulation are introduced. Section |III| is concerned with the description of the stages 
of LORD for two transmit antennas because LORD is optimal in this case. In Section IIII-AI 
an efficient preprocessing algorithm is described; Section IIII-BI focuses on the lattice search, 
and IIII-CI deals with the bit soft output generation. Section [IV] and its subsections include a 
formulation of LORD suitable for any number of transmit antennas. Section M confirms through 
numerical results that LORD provides an excellent performance-complexity tradeoff. Finally, 
some concluding remarks are reported in Section [VTl 
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II. System Model and Lattice Representation 

We consider a MIMO communication system with L t transmit and L r receive antennas, and 
a frequency nonselective fading channel. We also assume the receiver has perfect knowledge of 
the channel state and each receive antenna has a matched filter to the pulse shape. Then the 
complex baseband received signal y c = (Y c i . . . Y cLr ) T is given by: 

^ H c x c + n c (4) 

Lit 

where the input signal x c = (X cl . . . X cLt ) T is the QAM or phase shift keying (PSK) 1 complex 
information symbol vector, E s is the energy per transmitted symbol (under the hypothesis that 
the average constellation energy is J^ CJ - 1 2 ] = 1), n = (N cl . . . N cLr ) T is the L r x 1 complex 
white Gaussian noise (AWGN) sample vector, H c is the L r x L t complex channel matrix. The 
entries of H c are the i.i.d. complex path gains H C ji ~ jV c (0, 1) from transmit antenna i to receive 
antenna j. At receive antenna j, the corrupting i.i.d. noise samples are N C j ~ J\f c (0,N ). As 
it will prove useful in the following, the i th column of H c is denoted as h ci . Equation © is 
assumed to be valid per each OFDM tone if a MIMO-OFDM system and frequency selective 
channels are considered. 

This paper assumes QAM modulation and derives a real lattice formulation. As a variant to the 
traditional lattice formulation [16], the system © can be translated into the form <[TJ performing 
appropriate scaling and ordering the I and Q of the complex entries as follows: 

x r = [X ltI , X ltQ , . . . X LuI , X Lti Q] T 

= [xi, . . . X2L t f (5) 

y r = ^i,<2> • • • Y Lr j, Y Lr: o\ T (6) 

n r = [N lsI , N l>Q ,...N LrtI , N Lr , Q f. (7) 



Then © can be re-written as: 

HZs IE. 
J^t V Lit 



hi, , h 



2L t 



x r + n r . (8) 



'it should be noted that a single version of SD cannot handle both QAM and PSK transmit symbols; a modified SD [22] 
has been proposed for the latter case. LORD can be easily adapted to handle PSK modulations and requires only a minor 
modification in the demodulation section. 
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Each pair of columns (h 2fe _i, h 2fe ), k — {1, . . . L t } of the real channel matrix H r has the form: 

h 2fe _! = [M[H lk ], Q[H lk ], M[H Lrk ], %[H Lrk ]] T (9) 
h 2k = [-Z[H lk ], 5l[H lk ], -%[H Lrk ], M[H Lrk f (10) 

As a direct consequence of this formulation, they are pairwise orthogonal, i.e. 

h 2fc-l h 2fc = 

where k = {1, . . . ,L t }. This property will prove to be essential for LORD simplified demodu- 
lation. Other useful relations are: 

||h 2 fc-i|| 2 = ||h 2 fc|| 2 (11) 
h 2fe-i h 2j-i = h 2 fch 2 j, h 2fc-i h 2j = -h^h^-.i 
where k,j — {1, . . . , L t } and k ^ j. 

III. LORD Algorithm - case of two transmit antennas 

This section is concerned with the derivation of LORD algorithm for the case of L t = 2 
transmit antennas. The two transmit antenna case is called out separately because in this case 
LORD is optimal. After the system is represented in the real-domain through the novel I and Q 
ordering, the proposed lattice detection algorithm requires two additional stages: preprocessing 
and lattice search. The purpose of preprocessing is to turn the MIMO channel into an upper 
triangular system. The proposed transformation is a computationally efficient alternate to a QRD 
as the normalizations are performed after the channel orthogonalization is completed, although a 
standard QRD would not impair the demodulation and performance properties of the algorithm. 

A. The preprocessing algorithm 

The channel matrix can be represented as 

H r = QR\ q (12) 

for L r > 2. To show this it is noted that an 2L t x 2L r orthogonal matrix can be defined 

Q = [ hi h 2 q 3 q 4 ] (13) 
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where 



q 3 = Hhil^hs - (hfh3)h! - (h^h 3 )h 2 
q 4 = ||hi||X-(hi'h4)hi-(l^h4)h 2 . 
Then, remembering (fTTTl one has: 

Q T Q = diag[||h 1 || 2 ,||h 1 || 2 ,||q3|| 2 ,||q 3 || 2 ]. 

It can also be written that 

||q 3 || 2 = INI 2 (||h 3 || 2 |M| 2 - (hfh 3 ) 2 - (h^h 3 ) 2 ) = HhillS 
where by definition 

r3 = ||h3|| 2 ||h 1 || 2 -(hfh3) 2 -(h^h 3 ) 2 . 

By defining the 2L t x 2L t upper triangular matrix 

1 hfh 3 hfh 4 



R 



1 h^h 3 h^h 4 

1 
1 



and the 2L t x 2L t diagonal matrix 

A, = diag[l,l, HhiirMlhxir 2 ] 
the original real channel matrix can be decomposed as 

H r = QRA 9 . 

The linear preprocessing proposed in this paper is given as 

f r = Q T y r 



(14) 
(15) 

(16) 

(17) 
(18) 



(19) 



(20) 



(21) 



(22) 



where all values of Q are simple functions of the known channel coefficients. The signal model 
after preprocessing is given as 



y- Rx r + Q T n r 



— Rx r + n r 



(23) 
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where 



R = Q T QRA (J 



M 2 





hfh 3 


h[h 4 





Ilhif 


h^h 3 


h^h 4 








r 3 














''3 



(24) 



is in the desired upper triangular form for lattice demodulation algorithms. System d23t is the 
real-domain lattice system equation the detection algorithm LORD uses, and is in the form of (UJ). 
The noise vector in the triangular model still has independent components but the components 
have unequal variances, i.e., 



Nn 



E[n r na=^diag[||h 1 



|hi|| 2 r 3 , ||hi|| 2 r 3 ] . 



(25) 



The advantageous characteristic of the model formulation is that R 12 = R34 = 0, i.e., each of 
the I and Q components of each transmitted signal are broken into orthogonal dimensions and 
can be searched in an independent fashion. 

As a further observation, all parameters needed in this triangularized model are a function of 
eight variables. Four of the variables are functions of the channel only, i.e., 



<7i 



||h x |r cr 2 z =||h 3 || z S 1 = hfh 3 S 2 
and four are functions of the channel and the observations, i.e., 

V 1 = h?j r V 2 = hTy r V 3 = hly r V A = hjy 



hfh 4 . 



4 Jr- 



(26) 



(27) 



Also, equalities (fTTTl imply that the 2x2 matrix in the upper right corner of R is a rotation 
matrix. Specifically the required results for the upper triangular formulation is 

si 




Vi 















m 




v 2 







°i 


-s 2 






R = 






m 




(T1V3 - S1V1 + s 2 V 2 










-sl- 4 






_ a\V A - s 2 Vi - siV 2 _ 














ST — S 



(28) 



This formulation results in a preprocessing complexity (expressed in terms of real multiplications, 
RMs) that is 0(16L r + 9). 
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As it will prove useful when dealing with soft output generation, we also notice that shifting 
the ordering of the transmit antennas results in a similar model: 



where 



y~si 




y~s2 




y~s3 









v 3 
v, 

4 



a 2 2 V 2 + s 2 V 3 




o R s x s + n s 



S2V4 



a\ 
o\ 



si 

S2 








a(<72 ~ s 











s 2 

Si 





(29) 



°"r°"2 - s i - s 2 

(30) 



E [n s n T s ] 



f diag[||h 3 || 2 



|h 3 || 2 , ||h 3 fr 3 , ||h 3 || 2 r 3 ] (31) 

and x s = [x 3 x 4 xi x 2 ] T . 

Finally, we observe that there is an interesting relationship between the triangularized model 

9 1 2 

parameters and the complex channel coefficients. First it should be noted that a{ = |h c i| and 
that \ = |h c2 | 2 . Secondly, the sample cross correlation between the gains for transmit antenna 
1 and transmit antenna 2 is given as 



h^h c i = si + js 2 . 
A sample crosscorrelation coefficient can be defined as 



P12 



hf 2 h cl 



Using (l33l) . formula (IT~8l) can be written as 

r 3 = o\a\ (1 - |p 12 | 2 ) 



(32) 



(33) 



(34) 



It is apparent that when L r gets large the magnitude of p 12 will go to zero and the MEMO 
detection problem for each antenna will become completely decoupled. 
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B. Lattice search and demodulation 

The system equations defined in Section IIII-AI lead naturally to a simplified yet optimal ML 
demodulation. Consider a PSK or QAM constellation of size S. The discussion in this paper 
will assume that (M 2 )-QAM modulation is used on each antenna but a generalization to any 
linear modulation is possible. The optimum ML word demodulator © would have to compute 
the ML metric for M 2Lt constellation points and has a complexity 0(M 4 ) for L t = 2. 2 

The notation used in the sequel is that Q x will refer to the M-PAM constellation for each 
real dimension. Given the formulation in d23l) - (l28l) and neglecting scalar energy normalization 
factors to simplify the notation, the ML decision metric becomes 

m, s |i~ «n 2 (fa ~ a i x i ~ s i x z ~ S2X4) 2 (y 2 - cr\x 2 + s 2 x 3 - S1X4) 2 
T(x r ) = ||y r -Rx r || = — 2 + 2 



01 o 1 
+ ( ^7 3X3)2 + ^ (35) 

The ML demodulator finds the maximum value of the metric over all possible values of the 
sequence x r . This search can be greatly simplified by noting for given values of x 3 and x 4 the 
maximum likelihood metric reduces to 

T(x r) = fo-^* 1 -frfo'**)) 2 + (y2-^x 2 -C 2 (x 3> x 4 )) 2 + 

where 

Ci(x 3 , x A ) = six 3 + s 2 x 4 C 2 (x 3 , Xi) = -s 2 x 3 + S1X4 C 3 (x 3 , x 4 ) > (37) 

The originality of LORD stems from the fact that - as clear from (|3*6T ) - the conditional ML 
decision on x\ and x 2 can immediately be made by a simple threshold test, i.e., 

^.^(tz^). ^.^(fez^i). (38) 

where the round operation is a simple slicing operation to the constellation elements of Q x . This 
property is direct consequence of the orthogonality of the problem formulation. The final ML 

2 This statement applies to the "exhaustive-search" ML demodulator; a triangular decomposition of the channel matrix in itself, 
as in a standard QRD, would be enough to lower the complexity of the search to 0(M 2Lt ~ 1 ). 
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estimate is then given as 



r~ . v . ,« . v ~ * , mil1 , J (^1 - (T?£l(X3,X 4 ) -Cl(x 3 ,X 4 )) Z 

(xi(a;3, x 4 ), x 2 (x 3 , x 4 ), x 3 , x 4 | = arg x 3 ,x 4 &n 2 x 



+ fe-^ 2 (x3,X 4 )-C 2 (x3,X 4 )) 2 +C3(X3;X4) | (39) 

This implies that the number of points that has to be searched in this formulation to find the 
true ML estimator is M 2 (with two slicing operations per searched point). This is a significant 
saving in complexity. 

It should be noticed that, in a direct analogy to (l29b- (l30b . the ML estimate could as well be 
found minimizing the reordered ML decision metric 

T'( \ \\~ _P II 2 - ~ a 2 X3 - SlXl + S2X2 ) 2 , (Vs2 ~ ~ S 2 X\ - SiX 2 f 

Similarly to d38t -(l39l). minimization of (l40b can be accomplished considering all possible M 2 
values for (xi,x 2 ) and obtaining the corresponding (x 3 (xi, x 2 ), x 4 (xi, x 2 )) through rounding 
operations to the constellation elements of Vt x . 

We observe that this reduced complexity ML demodulation is a direct consequence of the 
reordered lattice formulation. Each group of two rows in the model (l28t corresponds to a transmit 
antenna, or layer (the two terms will be used interchangeably in the remainder of the paper). 
Equation d38l) shows that the decisions for the top layer can be made independently for the I 
and the Q modulation. If the traditional lattice formulation [16] is adopted instead, in (l35l) the 
partial ED (PED) terms corresponding to the higher rows of the triangularized model become 
dependent on all the lower layers of the transmit modulation, and the simplified demodulation 
d39t is no longer possible. 

Two further observations conclude this section: 

The search of the lattice points can be carried out in a completely parallel fashion. 

This solves one of the drawbacks of SD algorithm, characterized by a recursive - i.e. 

serial - search, and is desirable for VLSI implementations. 

The lattice point enumeration technique, i.e. method of spanning the points during the 
search, is not important for LORD as long as all M 2 possible cases for the bottom 
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layer are searched. However, we observe that ordering the candidate list according 
to an increasing ED from the receiver observations has important implications for 
suboptimal searches, i.e. if less than M 2 values for the bottom layer are considered. 
This corresponds to the Schnorr-Euchner (SE) [21], [14] enumeration method. Future 
work will address this important sub-optimal and reduced-complexity version of LORD. 



C. LLR generation 

This section deals with the reduced-complexity generation of reliable soft output information. 
This problem is often neglected in lattice decoding literature because of the intrinsic difficulties 
caused by the SD attempt to reduce to the minimum the number of searched lattice points. As 
mentioned in Section ID a partial solution to this issue has been proposed in [22], [23] with the 
introduction of the so-called "candidate list". Unfortunately the random nature of the selected 
points to be stored in this list pose several implementation and complexity issues, also evidenced 
in [24]. To name a few, no rule to optimally size the list has been proposed, and simulation results 
show that in order to obtain reliable LLRs the size depends on the considered MIMO scenario; 
also, points are stored in the list in an inherently sequential manner; the "quality" of the points 
stored in the list, as well as the total number of searched sequences before the search can be 
declared concluded, strongly depends on choice of the sphere radius. The use of LR techniques 
can only help in making the convergence to the ML solution faster, but precludes the detection 
algorithm from computing soft-output values, because the boundaries of the information set are 
no more recognizable after the application of such techniques. The choice followed in this work 
was then to avoid LR methods and to solve the indeterministic and sequential nature of the 
selection of the sequences needed for the generation of reliable bit soft-output information. 

The problem is first recalled for the complex-domain system ©. If M 2 -QAM constellation is 
considered for the information symbol vector and M c is the number of bits per symbol, the LLR 
or logarithmic a-posteriori probability (APP) ratio of the bit k = 1, . . . , 2M C , conditioned on 
the received channel symbol vector y c , is often expressed as: 



P(y c |x c )P a (x c ) 



L (h\y c ) = In 



P(p k = l|y c ) 
P(h = 0|yj 



In 



x c es(fc)+ 




(41) 



x c eS(fc)- 
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In (gU) S(k) + (S(k)-) is the set of 2 2A/c ~ 1 bit sequences having b k = 1 (b k = 0); P a (x c ) 
represents the a-priori probabilities of x c and will be neglected in the rest of this paper as 
equiprobable transmit symbols are considered. From ©, the likelihood function P(y c |x c ) is 
given by: 



P(y c |x c ) oc exp 



1 H E >1* II 2 

|y c - v^-Hxcll 



exp [-D(x c )) (42) 



2a 2 > wc V 2 

where a 2 = N /2 and D(x c ) is the ED term. The summation of exponentials involved in d4lT) 
can be approximated according to the following so-called max-log approximation: 

In y exp [— -D(x)] ~ In max exp [— -D(x)] = — min -D(x) (43) 
^— ' ' " * xes(k)+ ' xes(k)+ 

xe5(fc)+ 

Expression (l43l) is equivalent to neglecting a correction term in the exact log-domain version of 
d4lT) . which uses the "Jacobian logarithm" or max* function 

jacln(a, b) := In [exp (a) + exp (&)] = max (a, 6) + In [1 + exp (— \a — b\)). (44) 

As shown e.g. in [30], the performance degradation caused by the max-log approximation is 
generally very small compared to the use of the max* function. Using d4*3l in (|4*TT) . max-log bit 
LLRs can then be written as: 

L(b k \y c ) w min D(x c ) - min £>(x c ) (45) 

x c e5(fc)- x c e5(fc)+ 

Expression (1451) involves two minimization problems, i.e. for each bit index k = 1, . . . , 2M C it 
requires identification of the most likely transmit sequence (or lattice point) where b k = 1 and 
the most likely one where b k = 0. By definition, one of the two sequences is the hard-decision 
ML solution of ©. However, using SD, there is no guarantee that the other sequence is found 
during the lattice search. LORD does not have this problem, as shown in the sequel. 

The formulation of the problem in case of real-domain lattice equations is perfectly similar. 
From (l23l the LLRs assume the form: 

L(b k \f r )=ln Xr ^ )+ (46) 

x r es(fc)- 

where, recalling (l35l) . the likelihood function P(y r |x r ) is given by: 

P(y r |x r ) = exp[-|T(x f .)|]. (47) 
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Let us first focus on the bits corresponding to the complex symbol X c2 in the symbol sequence 
x c = (X c i X c2 ) T . By employing arguments similar to those that led to the simplified ML 
estimator d39l) . it can be easily proven that the two ED terms needed for every bit in X c2 are 
certainly found computing (l35t over the possible M 2 values of X c2 = (x 3 , x 4 ) and minimizing 
the expression over X cl = (xx, x 2 ), for every value of X c2 . This last operation is simply carried 
out through the slicing operation to the constellation elements of £l x , described in (l38l) . The 
LLRs relative to the bits corresponding to X 2 , b 2< k, can then be written as: 

L(M50 « min T (*r) - min T(x r ) (48) 

where k — 1, . . . , M c , and S(k) 2 (S(k) 2 ) are the set of 2 Mc ~ 1 bit sequences having h 2 ^ = 1 
(62,* = 0). 

The computation of the LLRs for the bits corresponding to symbol X\ can be obtain by 
a simple reordering of the model and a repeating of the LORD processing, as for d2"9l - (l30l) . 
Recalling that x s = [x 3 x^ x\ x 2 ] is the reordered information sequence, using (l40l) the LLRs of 
the bits corresponding to X\, b x ^, can be written as 

L(b lik \y s ) « min T'(x s ) - min T'(x s ) (49) 

xi,X2£S(k) 1 xi,X2GS(k)^ 

where k = 1,...,M C , S(k)f (S(k)i) are the set of 2 Mc_1 bit sequences having = 1 
(6i,fe = 0). There is significant complexity reduction that can be utilized in forming the LLR. 
By comparing (l28l) and d30b . it is apparent that much of the preprocessing computation needed 
in (l28t can be used in the reordered d30T> . The resulting complexity of the preprocessing stage 
will be 0(16L r + 12). The lattice search for both orderings will have complexity 0(2M 2 ) due 
to the max-log LLR computation. 

IV. LORD Algorithm - case of L t transmit antennas 

The LORD detection algorithm can be generalized to any L t > 2 and L r > L t in a sub-optimal 
way but still often remaining near-ML, as shown in Section |Vj Specifically, a computationally 
efficient QRD algorithm is described in Section IIV-A1 A notationally compact and elegant 
recursive variant of QRD is given in the Appendix. The relation between the extended and the 
compact representations is analogous to that existing between QRD through GSO and modified 
GSO (MGSO) [31]. The main difference between the GSO proposed in this paper and the QRD 
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[31] is represented by the way the normalizations are handled, as clear from the sequel. The 
lattice search and soft output generation are then obtained generalizing the steps described in 
IIII-BI and IIII-CI respectively. A block diagram highlighting LORD algorithm steps is reported in 
Fig. ID 

A. The preprocessing algorithm - standard formulation 

The formulation described in the sequel can be viewed as a generalization of the equations 
reported in Section lTlI-Al for L t = 2. This preprocessing corresponds to GSO with normalizations 
deferred to a later stage. To best understand this preprocessing note that there is an 2L t x 2L r 
orthogonal matrix 







Q = 


where 






Qi = 


hi 




q 2 = 


h 2 




q 3 = 


afh 3 - si )3 hi 


- s 2 ,3h 2 


q 4 = 


cTih 4 - si )4 hi 


— S 2j 4h 2 



qi q 2 q 3 q 4 q^-i ^ Lt 



(50) 



(51) 



q 5 = r 3 alh 5 - r 3 si j5 hi - r 3 s 2i5 h 2 - t 3i 5q 3 - *4,5<l4 (52) 



fc-i 

% — Pi ~ s i,phi — s 2jP h 2 ] — (*2i-i, P q2i-i + hi,p*hi)\ — ^ 2 fc-i,pq 2 fc-i — ^ 2 fc,pq 2 fc 

i=2 

where p denotes the generic k — th pair of q columns, i.e. p = {2k + 1, 2k + 2}, with k 
{2, . . . , L t — 1}, and which uses the following definitions: 



hjh k , t jtk = qjh k , a 2 k =\\hl\\ Pl=Y[r 2j ^ (53) 
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where m, n are integers with 1 < m < n. The terms r 2 k~i, with k — {1, . . . L t }, are given by: 



^-2 2 



'1,3 a 2,3 



(54) 



fc-2 



1 _ ^2 1 ( <J l <T 2fc-l S l,2fc-1 S 2,2fc-l) /. (*2i— l,2fc— 1 + ^2i,2fe-l) (55) 



1=2 



L 2fc-3,2fc-l ( '2fc-2,2fc-l- 

They can also be written in the compact form 

r 2k -i = Pt^Ozk-iO- ~ \Pi, 



fc-i 



i=2 



where we have used the square magnitudes of the (generalized) correlation coefficients: 



'2fc-l,2i- 



(J 2 (T 2 
u 2k-l u 2j-l 



t 



2k-l,2j 



2ft-l,2j 



j > k. 



\Pk,j\ 

\p'k,j\ 2 

||*l2fc-l|| <T 2j-l 

It is easily shown that (fTTT) can be generalized as: 

ll#> l|2 ||„ 1 1 2 r>kJ2 

q2fc-i h 2i-i = q^^j, q2fe-i h 2j = -q^i^-i, j > k. 

Also, by construction the q vectors, and {q,h} couples, are pairwise orthogonal, i.e. 



q2fc-iq 2 fc = o, q^-ih 

The orthogonal matrix Q then satisfies 

Q T Q = diagL 2 , of, ||q 3 || 2 , ||q 3 || 2 , 



2k 







q2L t -i 



(56) 



(57) 



(58) 



(59) 
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By defining the 2L t x 2L t upper triangular matrix 



R 



1 





Sl,3 


Sl,4 


r 3 si, 5 ... • 


• ... Pt- l Sl,2L t -l 


A Lt ^Sl,2L t 





1 


-Sl,4 


Sl,3 


-r 3 Sl,6 ••• • 


• ••• -P^S^Lt 


r x Sl,2L t -l 








1 





^3,5 ..• - 


■ ■ ■ ■ P^* 1 ^3,2L t -l 


^f" 1 *3,2L t 











1 


*4,5 ••• • 


• -.. --P2 L ' ^3,2L t 


-Rf* 1 *3,2L t -l 






































1 









1 






^2L t -3,2Li-l 
—t2Lt-3,2L t 

1 





^2Lt- 3 ,2L t 
hL t -3,2Lt-l 



1 



(60) 



the real channel matrix H r can be decomposed in the product: 

H r = QRAy 

where the 2L t x 2L t diagonal matrix 

A, = diag [l, 1, a~ x \ a~ x \ . . . (P^af)' 1 



(61) 



(62) 



includes the normalization factors due to the fact that Q is not orthonormal. Note again all 
values of Q are simple functions of the known channel coefficients. Again the signal model 
after preprocessing is given as 




- Rx r + Q n r = 




The triangular matrix R = Q 1 QRA 9 , given by: 



R 
















-I 2 











Sl,3 «1,4 
-Sl,4 «1,3 
?~3 





















Sl,5 
-Sl,6 

^3,5 
— ^3,6 








Rx r + n r . 



s l,2L t -l 
Sl,2L t 
h,2L t -l 
~h,2L t 



(63) 



Sl,2L t 
S l,2L t -l 

h,2L t 
h,2L t -l 



r 2L t -3 







t 2 L t -3,2L t -l t 2 L t -3,2L t 

f'2L t ~3 —hLt-3,2L t ^2L t ~3,2L t -\ 

r 2L t-i 

r 2Lt _! 



(64) 
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The noise vector in the triangular model still has independent components but with unequal 
variances given by 3 : 

R, = E [nh T ] = ^diag af, if* of, Ppoft . 

The resulting preprocessing complexity expressed in terms of RMs is 0(2L r L 2 + 2L 2 + AL t L r + 
K), where K = 13 for L t = 4 and grows asymptotically as yL^ for large L t . More detailed 
explanations on complexity are reported in Section IIV-DI We note that this result takes into 
account that an explicit computation of the matrix Q is not required, but rather it is possible to 
proceed to the direct computation of the scalar products Q T y r . Also, the benefit of deferring the 
normalizations will become apparent from Sections IIV-CI and IIV-DI 

B. Lattice search and demodulation 

Having the matrix Q allows an observation model like (l63t to be derived and a simplified 
demodulation is possible. Using the structure of R shown in (l64b . the decision metrics can be 
written as: 



( ~ 2 sr^2L t 



^ 2 

kXk 



T(x r )= y r -RxJ = ^ = (65) 



(XT 



(~ 2 V^2L t 



0"T 



(y-i - r 3 x 3 - Efc= 5 h,k x k 



(V2L t -l ~ r 2 L t -lX2L t -l) 2 + {V2Lt ~ ^U-lXlLtf 

+ a 1pU-\ 
°\ r 2 

The proposed simplified demodulation consists of considering all M 2 values for the I and Q 
couples of the lowest level layer. For each hypothesized value of x 2 L t -i and x 2 i t , here denoted 
x 2 L t -i and x 2 L t , the higher level layers are decoded through interference nulling and cancelling, 
or ZF-DFE. For a given layer ordering these operations are similar to the QR version of the 
O-DFE algorithm except for the important difference represented by operating in the real domain 

3 It should be noted that in practical implementations the normalizations should be performed just prior to the lattice search 
in order avoid including the different noise variances in the ED computation i65\ . 
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through a novel lattice formulation. The estimation of the I and Q of the remaining L t — 1 symbols 
is implemented through a slicing operation to the constellation elements of Vt x for x±, . . . x 2 L t ~ 2 - 
By writing: 



T(x r ) 







d (x 3 , . . 


■ x 2L t )f 










! ($2 


- o\x 2 


- C 2 (x 3 , 


■ ■ -X2L t )) 2 










, (ya 


- r 3 x 3 


- C 3 (x 5 , . 


■■X 2 L t )f 





+ ■■■ + C 2Lt -l (%2L t -l, %2L t ) (66) 

where 

n {V2L t -\ — r 2Lt-l x 2L t -l) + {V2Lt ~ r 2L t -l x 2L t ) f( - n , 

C 2 L t -i = Lt ^ (67) 

°l r 2 

then the conditionally decoded values of x\, . . . x 2 L t -2 as function of each candidate couple 
(x 2 L t -i, x 2 L t ) are determined recursively as: 

V2U-2 — C 2 L t - 2 {x 2 L t -l-,X 2 L t ) 



X2L t -2 = round 



r 2L t -3 



„ = mundl' ^"^^ ^'^'^ ) ,68) 



Denoting these 2L t — 2 conditional decisions as x 1 - ' (x 2 L t -i, x 2 L t ), the resulting sequence esti- 
mate is then determined as: 



v 



|x ( ) (x 2Lt -!, X 2Lt ) ,X2L t -l,X2L t ] (69) 



where 



{x 2 L t -i,x 2 L t } = arg min T(x ( ) {x 2Lt -i, x 2Lt ) , x 2 L t -i, x 2Lt ) (70) 

X2L t -l,X2L t £^l 



Recall each group of two rows of R in d64l) corresponds to a transmit antenna. At the bottom 
of the triangularized model the search for the I and Q of the L t -th transmit antenna is broken 
into orthogonal dimensions and can be carried out independently. Also, looking at each k-th 
pair of rows (2k — 1,2k) of (l64l) it is clear that the corresponding I and Q couple (x 2 k-i, x 2 k) 
can be decoded independently once the interference from the lower layers has been cancelled. 
These orthogonality relations were not true for the traditional lattice formulation [16]. Differently 
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from the case of L t = 2 transmit antennas, however, the generalized low-complexity search is 
suboptimal. A lower complexity optimal ML demodulation would still be possible through slicing 
(x%, x 2 ) over all the possible M 2 ( Lt_1 ) values of the other elements, but this would still be too 
complex for L t > 2. Near-optimal hard-output performance would be possible if the layers are 
ordered properly in the above described demodulation scheme, as it will be highlighted in future 
works. Simulation results, not reported in the present paper, confirm this statement. The next 
section will show through numerical results that ordering is not essential in order to achieve 
near- ML performance in BICM systems. 

C. Bit LLR generation 

The proposed idea is to approximate the minimization of the two terms involved in (l45t using 
the principles exemplified with (I65H69I) . Let us consider the bits corresponding to the complex 
symbol X^ t in the symbol vector x c = (Xi, . . . Xl ± ) t ■ The sequences used to minimize the two 
terms of d45t are determined considering all possible M 2 values for X Lt , while the value for 
the other elements (xi, . . . X2U-2) is derived through the DFE operation d68t . Equation d45t can 
then be approximated as: 

L(b Lt ,k\f) = min T(X ( ~ } (x 2 L t -l,X 2 L t ) ,X 2 L t -l,X2L t ) 

{x2L t -i,S:2L t }eS(k) Lt 

- min T(x { ~> (x 2Lt ^, x 2Lt ) ,x 2Lt -i,x 2Lt ) (71) 

{x2L t -l,X2L t }&S(k)+ t 

where bL u k are the bits corresponding to X^ t , k = 0, . . . , M c — 1, and S(k)^ [S(k)2) are the 
set of 2 Mc_1 complex symbols having 6 Lt fc = 1 (&L t) fc = 0). 

In order to compute the approximated max-log LLRs also for the bits corresponding to the 
other L t — 1 symbols in x c , the algorithm has to compute the steps formerly described for different 
layer orderings, where in turn each layer becomes the reference one only once. In other words, 
we need models where the last two rows of the triangular matrix d64t correspond, in turn, to 
every symbol in x c . This can be accomplished starting from the natural integer order sequence 
x c and generating the other L t — 1 permutations recursively by exchanging the last layer with 
all the others; then, the columns of the real channel matrix H r have to be permuted accordingly, 
prior to performing the GSO. 

Some considerations on the resulting preprocessing complexity are in order here. The overall 
complexity can be estimated recalling that by applying the GSO, the QRD computes the matrix 
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R line by line from top to bottom and the matrix Q columnwise from left to right, as clear 
from (BIT) and (l54b . This would suggest that in order to minimize the complexity the considered 
permutations should differ for the least possible number of indexes. In this case many operations 
would not have to be recomputed for different symbol orderings. Anyway, the core of the 
processing consisting in the scalar products between 2L r -element vectors can be computed only 
once thus keeping an overall cubic complexity with the number of antennas. This is a consequence 
of the absence of normalizations in the GSO computation, as better detailed in Section IIV-DI 

For the sake of argument, let us consider the following set of index permutations of the 
complex symbol sequence x c . Let TiL t be the natural integer order index set, where the reference 
layer is the L t -th. Then, a possible efficient set for a recursive APP computation is: 

ir Lt = h-L t (72) 

TTXt-l = 1) • • • L t , L t -i 

KLt-2 — 1) ■ ■ ■ Lt-l, Lt, Lt-2 

7Ti = 2, 3, ...L t , 1 

Let Ilj denote a 2L t x 2L t permutation matrix such that arranges the columns of H r according 
to the index set ttj. Then the GSO yields: 

H r IL; = Q®R®A® (73) 

and the matrix R^') can be computed as R^) = Q^ T Q^R^Ag 7 -'. Finally, we can write: 

yfO = QW T y r = ^ R(i) x a) + Q W (74) 

(i) 

where x, is the permuted I and Q sequence. Indicating the corresponding ED metrics as 
T^\xr^), the LLR of the bits corresponding to the j-th symbol can be written as: 

HbjAf^) = min _ r (j ' ) (x} _) (x 2 j-i,x 2 j) ,x 2 j-i,x 2 j) 

{x 2 j-i,x 2 j}eS(k) j 

- min T^\xy (x 2 j-i,X2j) ,x 2 j-i,x 2 j) (75) 

{x 2 j-i,x 2 j}eS{k)J 

where bj^ are the bits corresponding to Xj, k = 0,...,M C — 1, S(k)^ (S(k)j) are the set 
of 2 Mc ~ 1 bit sequences having bj^ = 1 {bj,k = 0), and x^ (x 2 j-i,x 2 j) denotes the 2L t — 2 
conditional decisions of the layer order sequence nj in d72l) . in analogy to d68t . 
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It is apparent that LORD is an approximated method for bit LLR generation relying on a 
lattice search of L t M 2 symbol sequences as opposed to a search of M 2Lt as required by the 
maximum a-posteriori probability (MAP) demodulator. A further practical advantage of LORD 
is that the LLR computation for the bits corresponding to the L t symbols can be carried out in 
a parallel fashion. 

D. Complexity estimation 

The aim of this section is to clarify the complexity estimates previously reported, focusing 
on the general case of a single-carrier MIMO system with L t transmit and L r receive antennas 
and known CSI at the receiver. The estimates are expressed in terms of RMs. For static or 
slowly-varying channel applications, like WLANs, it is also important to distinguish between 
channel-dependent and receiver observation-dependent terms, because in this case CSI can be 
computed once per frame (or packet) differently from the observation-related terms. 
• Channel dependent terms. 

They are the entries of the matrix R in (l64l) . A significant observation is that the number of 
nonzero real entries to be computed is L 2 , instead of 2L 2 + L t . This is a consequence of the 
adopted I and Q ordering and particularly of (fTTT) . (l58l) . 

Each of the L 2 entries involves the computation of the scalar product of a 2L r -element 
vector, for a resulting complexity 0(2L r L 2 ). Specifically, they are o\ k _ x = ||/?|fc-i||' 
with k = 1, . . . L t , and the terms Sjj = hjhj, with i < j and j = 1, . . . 2L t ; it should 
be noticed that also = qfh, ultimately depend on Sij, as clear from (BTTl . 
The computation of the terms t^j grows quadratically with L t for L t > 4, when L t — 2 
couple of columns of R including those terms are present. When L t = 2 no such terms 
exist, while there are only terms tij involving q 3 ,q 4 if L t = 3, i.e. the Uj do not depend 
recursively on themselves as evident from (BTT) . The complexity associated with these 
computations is then 

K = 6, L t = 3 (76) 

K = —L 2 - —L t + 87, L t > 4 
2 t 2 

but is anyway limited for practical L t , e.g. K — 13 with L t = 4. 
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The computation of the L t — 1 diagonal terms r 2 k-i <EJ, with k — 2, . . . L t requires 
2L 2 t - AL t + 3 RMs. 

Overall, the resulting complexity associated with the computation of the matrix R can be 
estimated as 0(2L r L 2 t + 2L 2 - AL t + 3 + K). 
• Observation dependent terms. 

It should be noted that the explicit computation of the orthogonal matrix Q is not required, but 
rather it is possible to proceed to a direct computation of the elements of the vector y r = Q T y r . 
In fact the scalar products qJY r ultimately depend on a linear combination of the scalar products 
Vk = h^y r , with k — 1, . . . 2L t , whose total complexity is AL t L r RMs. The resulting additional 
complexity due to the linear combinations can be estimated as: 

W = 6, L t = 2 (77) 

W = 16, L t = 3 

W = UL t - 26, L t > 4 

The total complexity can then be estimated as 0(AL t L r + W). It should be observed that the 
complexity of the observation-dependent terms is quadratic with the size of the system, as 
opposed to the cubic dependence of the channel related terms, but for static or slowly-varying 
channels the involved operations must be updated more frequently than those related to the 
channel. 

The processing complexity derived so far does not take into account the extra-complexity 
arising from computing some of the coefficients of the matrix R and the elements of y r L t 
times (cfr. Section IIV-CI) . A precise complexity estimation would dependent upon the specific 
adopted permutation set, of which (1721 is an example. Here we just point out that even in 
a pessimistic scenario where no re-use of the formerly executed computations were possible 
the resulting complexity would be given by L t times K dTBT ). W (1771) . and the number of 
multipliers associated with the elements r 2 k-i- The complexity order of magnitude thus would 
still remain cubic with the dimension of the MIMO system. It should be stressed that this is a 
consequence of not having the normalizations in the GSO. Thanks to this variation, ultimately 
the scalar products between 2L r -element vectors which represent the main contribution to the 
preprocessing complexity, involve non-normalized channel columns as in hjy r or hjhj. This 
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means they can be re-used in computing the GSO for any layer ordering. 
• Complexity of the lattice search. 

The complexity associated with the demapping and bit LLR calculation has a crucial role for 
hardware implementations of the algorithm, as the related operations need to be updated for 
every channel observation and are proportional to both L t and S = M 2 , the size of the complex 
constellation. A high-level estimation can be carried out recalling that the computation of the bit 
LLRs corresponding to the j-th symbol d75t requires M 2 squared norms of 2L r element vectors. 
Thus, in first approximation the complexity for the whole transmit sequence is 0(2L 2 M 2 ) RMs. 
This estimate is correct under the assumption that in (l65l) the number of products mainly derives 
from 2L t squares. That can be justified as integer M-PAM values Xk are to be spanned; thus 
products like cxk where c is a constant value can be handled as sums like 

CXk = CXq + 2kc, 

where x = — (M — 1), k = 0, 1, . . . (M — 1), provided that intermediate products terms cxq are 
stored. 

This complexity estimate could be further reduced if implementation optimizations already 
proposed for SD [32] are adopted for LORD too. Among others, it has to be mentioned the 
possibility of "tree pruning", i.e. during PED term computations (l65l) it is possible to take into 
account a threshold derived from former EDs and stop the computation at any layer if the sum 
of the already computed PEDs is higher than such a threshold. Besides it should be noted that 
possible simplifications to the vector norm computation (e.g. through I 1 or l°° norms) may be 
applied to LORD as well. However their impact on LLR accuracy should be carefully evaluated 
first. 

V. Simulation results 

In this section the performance of LORD is reported in two main MIMO-OFDM configurations 
of interest: BICM, which is the main scheme considered by next generation wireless standards 
(Fig. El); STC mapping without concatenated ECC (Fig. |3j). 

Several detection algorithms have been simulated in MIMO-OFDM BICM scheme. In a subset 
of cases, also the performance of exhaustive-search ML detection was verified, including L t = 3 
and 16QAM corresponding to 4096 operations per complex symbol. It should be noted that by 
ML "exhaustive search over the constellation symbols" is meant throughout this section. 
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The block diagram of the system is depicted in Fig. 12 The system specifications are described 
in [33] and represent one of the proposals for the ongoing standardization activity of IEEE 
802.1 In next generation WLANs. In particular, the OFDM parameters are: 54 data tones out 
of a total of 64 tones; 20 MHz bandwidth; 3.2 fxs IFFT/FFT period and 0.8 [is guard interval 
duration. The basic ECC scheme we considered is a convolutional code (CC) cascaded with a bit 
interleaves The CC decoder is either a soft-input Viterbi algorithm (VA), or optionally a soft-in 
soft-out VA (SOVA, [34]) for use in turbo iterative combined decoder and detection schemes. CC 
performance is also compared with an advanced ECC option, i.e. a low density parity check code 
(LDPCC); no interleaver was used in this case. The considered LDPCC matrices are specified 
in [33] (1944-bit coded block size, code rates 1/2, 2/3, 3/4, 5/6); a theoretical description of 
their structure can be found in [35]. LDPCC simulations refer to 12 iterations of a log-domain 
version of the sum-product algorithm. 

In order to verify the performance in different channel conditions of practical interest, two 
different frequency selective channel models [36] were considered: channel B, characterized by 
a 9-tap tapped delay line profile with 15 ns root mean square (rms) delay spread; channel D, 
18-tap and 50 ns rms delay spread. Channel B is a useful benchmark for scenarios with limited 
frequency selectivity, like home residential environment, while Channel D has a significant 
frequency diversity as typical of indoor office. 

The performance has been simulated in terms of packet error rate (PER) versus SNR, for a 
1000-byte WLAN packet length; in the following, SNR gain will be related to 1(T 2 PER unless 
otherwise stated. The MIMO detector in Fig. |2] operates at subcarrier level, assuming known 
channel state information (CSI) and ideal synchronization. The following soft-output algorithms 
have been considered: LORD with max-log bit LLR computation; MMSE with max-log bit LLR 
computed taking into account the Gaussian approximation [8]; iterative MMSE and soft IC (SIC) 
as in [7], [8], with SOVA (optional feedback path in Fig. EJ); exhaustive-search ML with optimal 
bit LLR computed through the Jacobian logarithm (or "max*" function) [30]. MMSE-SIC plots 
refer to four stages of MMSE processing (i.e. three iterations), as no appreciable performance 
improvement can be observed for more loops. 

Fig. reports the performance of LORD versus MMSE for channel models B and D, CC 
coded system, and L t = 2, L r = 2 (in short, 2x2) MIMO system. Fig. |4(a)| and 4(b) refer to 



16QAM modulation code rate (CR) 1/2 and 64QAM CR 5/6 respectively. A significant SNR gain 
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over MMSE is visible in all cases, from a minimum of 2.2 dB for 16QAM modulation CR 1/2 
and channel D, to a maximum of 7.4 dB for 64QAM CR 5/6 and channel B; also, comparisons 
with ML confirm the optimality of LORD with L t = 2. The small performance degradation of 
LORD has to be attributed to the log-MAP LLR computation used for ML as opposed to the 
max-log used for LORD. Interestingly, LORD and MMSE-SIC show comparable performance 
in case of channel D, 16QAM CR 1/2 while even a single stage of LORD gains 0.6 dB of SNR 
over MMSE-SIC in case of 64QAM CR 5/6. However LORD gains more than 2 dB compared 
to MMSE-SIC in case of channel B, 16 QAM CR 1/2 and the advantage increases to about 
4.5 dB with 64QAM CR 5/6. These performance results offer several lines of interpretation. In 
terms of CR, the gain of LORD versus a linear suboptimal detector like MMSE increases for 
higher CRs. The advantage of LORD is also significantly higher when less frequency selectivity 
is made available by the system, as clear comparing performance obtained with channel models 
B and D. In particular, if limited frequency diversity exists as with channel B, MMSE-SIC does 
not show an appreciable BER curve slope improvement compared to a single stage of MMSE, 
which is the reason why LORD shows a higher gain in this condition. 

The performance of CC coded 64QAM, CR 5/6 is shown in Fig. |5] in case of a 2x3 MIMO 
system. It can be noted that the general trend visible in Fig. |4(b)| still holds also if additional 
spatial diversity is made available by the system, even though the relative gain of LORD versus 
MMSE decreases; nevertheless, a SNR gain higher than 3 dB is observable for 64QAM CR 5/6 
and channel model B. 

Fig. |6l[7] show that the advantage of LORD versus MMSE increases for MIMO systems with 
a higher number of transmit antennas (at least up to L t =4). Fig. |6] reports the performance of 
3x3 16QAM modulation, CR 1/2 and 3/4 respectively, for ML, LORD and MMSE detectors. 
Results confirm that LORD is suboptimal if more than two transmit antennas are used, but the 
gap over ML is contained within 2 dB for CR 1/2, and is only 1.2 dB for CR 3/4; in this last 
case the gain over MMSE is about 7.2 dB. 

Fig. |7] summarizes the performance of LORD and MMSE in case of 4x4 MIMO system, 
64QAM modulation, CC coded system with CR 5/6, channel models B and D; also, two plots 
with LDPCC and channel D are provided for comparison. The gain of LORD over MMSE is 
8.9 dB and 14.8 dB with channel model D and B respectively. Also, LORD shows > 3 dB of 
SNR gain over MMSE-SIC with channel D, while this gain increases to > 9 dB if channel B 
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is modelled. Then, LORD can avoid using iterative MMSE detectors, characterized by latency 
and complexity disadvantages. LORD iterative schemes are an envisioned topic for future work. 

We note that even though SNR levels much higher than 30 dB are probably difficult to achieve 
in practical 802.1 In systems, these results demonstrate the importance of a ML-approaching 
MIMO equalizer like LORD in order to implement the highest data rate transmission schemes 
currently under definition by 802.1 In standardization committee (the system of Fig.[7]corresponds 
to a data rate of 270 Mbits/s [33]). This is particularly evident for channel model B, where MMSE 
has a dramatical performance degradation. It should also be observed that an advanced ECC as 
LDPCC is able to provide a gain over CC in the order of 2 dB if used with MMSE, and of 
1.2 dB with LORD. The preliminary conclusion that can be drawn is that an advanced ECC 
in combination with a linear detector is not enough to recover the performance degradation 
of the system when limited frequency diversity is present. In this case, also iterative detection 
techniques do not prove to be effective if a detector unable to take advantage of receive diversity, 
as MMSE, is used as a first stage. 

Another case of interest, as previously mentioned, is represented by the algebraic STCs 
(ASTCs) [37], [28] in MIMO-OFDM schemes. The interest in this class of codes is motivated 
by their ability to yield full data rate (i.e. they transmit 2 symbols per channel use as L t = 2) 
and a maximal diversity order 2L r at the same time. Particularly the Golden Codes (GCs) [28] 
outperform all the other classes of STCs proposed so far to the authors' knowledge. However, 
ASTCs would require ML detection in order to provide full diversity order. In our simulations 
GCs were decoded using hard-output SD, according to the MIMO-OFDM block diagram shown 
in Fig. El The OFDM specifications were the same of the BICM system El MIMO-OFDM BICM 
LORD-detected systems outperform uncoded GCs for the same bits per channel use (bpcu), under 
channel conditions characterized by some degree of frequency selectivity like channel models 
B and D [36], as evidenced in Fig. [9] and Fig. [10] respectively for 2x2 and 8 bpcu. It should be 
noted that these results were obtained with 64-state CC Viterbi decoded, i.e. no powerful ECC 
was necessary. Only in i.i.d. flat fading channel (model A) the space-time coded system shows 
better performance than the BICM system (Fig. El). Two main considerations can be drawn from 
these results. On one hand, they confirm the importance of a low-complexity soft-output near-ML 
detector like LORD in order to fully exploit the space-frequency diversity embedded in layered 
BICM systems as specified by practical applications like 802.1 In. On the other hand, the plots 
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also show that for short block length codes like the GCs, a hard-output ML decoder like SD is 
not sufficient to make them attractive for next generation wireless systems; this would still be 
true even if optional front-end to accelerate the decoder convergence were used, as proposed in 
[18]. We then infer that low-complexity soft-output near-ML detectors and a properly designed 
BICM scheme would be needed also for GCs in MIMO-OFDM schemes. No such detectors 
have been proposed so far for full diversity full data rate ASTCs; in [38] BICM TAST was 
decoded through a low complexity message passing iterative decoder. The adaptation of LORD 
to optimally decode such codes is considered a topic for future research. Advanced ECCs do 
not seem to be essential if enough frequency selectivity is present in the system. 



In this paper a novel MIMO detection algorithm was proposed. LORD belongs to the class 
of lattice detector algorithms, though it uses a novel lattice formulation. A low-complexity 
channel preprocessing algorithm was described, alternative to the standard QR decomposition. 
The symbol sequence estimation is then performed through a parallelizable, reduced size and 
deterministic lattice search, also suitable for generation of reliable max-log bit LLRs. LORD was 
shown to be (max-log) optimal for two transmit antennas and near-optimal for three and four 
transmit sources in MIMO-OFDM BICM configuration, achieving higher SNR gain than linear 
and iterative nonlinear detectors, thanks to its very good exploitation of receive diversity. Also, 
the performance comparison with full diversity order two-transmit antenna uncoded STCs like 
the GCs showed that LORD detected layered BICM systems perform better even in presence 
of simple error correction codes like a Viterbi decoded convolutional code, provided that some 
degree of frequency selectivity characterizes the channel. 



The formulation for the frontend processing that was presented in Section IIV-AI can be given 
in an alternative equivalent recursive formulation. Recall the observations of interest are 



VI. Conclusions 



Appendix 



Recursive formulation of the preprocessing 




(A.78) 
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The orthogonal matrix can be obtained by defining the following quantities: 

ej(0,j)=h 2 i-i e Q (0,j)=h 2j <r{0,j) = \h 2j ^\ 2 1 < j < L t (A.79) 
with the following three order recursions (1 < i < j ' < L t ) 

= v(i-~L,i)ei(i-~L,j)-ri(i,j)e I (i-l,i)-r Q (i,j)e Q (i-l,i) (A.80) 
ZQihj) = ^(i ~^^)^Q(i-^J)+r Q (i, 3)^(1 -l,i)-r I (i,j)e Q (i-l,i) (A.81) 
a(i,j) = a(i-l,i)a(i-l,j)-(n(i,j)) 2 -(r Q (i,j)) 2 (A.82) 

where 

r I (i,j) = e I (i-l,i) T e I (0,j) r Q (i,j) = e Q (i - l,i) T e 7 (0, j). (A.83) 
With these definitions in place the columns of the orthogonal matrix are defined with vectors 

b M -i = e/(* - 1, i) b 2 i = e Q (i - 1, i) 1 < % < L t . (A.84) 

Computing the pair of orthogonal vectors would require i — 1 recursions of (IA.80I) and 
(IA.81I) . each one involving 2L r terms. However, from (IA.78I) and as observed in Section HV-Dl 
the matrix Q does not actually have to be computed to accomplish detection. The vectorial 
recursions specified above are important in computing terms that only appear in scalar products 
which, in their turn, can be expressed as linear combinations of the initialization scalar vectors 

r I0 (i,j) =e / (0,i) r e / (0,j) r QQ (i, j) = e Q (0,tf ei (0, j) l<i<j<L t . (A.85) 

The upper triangular matrix that results from the decomposition is also simply specified. The 
diagonal elements of R have the form: 

^M-i.ai-1 = f(« _ M) = -Ra»,ai- (A.86) 
The upper triangular elements are: 

#2i-i,2j-i = ri(i,j) R2i-i,2j = r Q (i,j) 1 < % < j < L t (A.87) 

and 

RtH,2j-i = ~r Q (i,j) R 2 i,2j = ri(i,j) 1 < i < j < L t . (A.88) 
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The noise, ri r , remains white and has a component- wise variance given as 

N i 

Rn 2i -^ = Raw = -f II a ^ - (A.89) 

i=i 

These recursions give the components of the upper triangular model that are needed for the 
detection algorithm. 

The post-processed observations are also specified with a recursion. These recursions are given 

as 

yi(i,j) = <r(i-l,i)yi(i-h3)-ri(i,j)yi(i-l,i)-rQ(i,j)y Q (i-l,i) (A.90) 

VQihj) = v( i - 1 , i )yQ( i - 1 J) + r Q( i J)yi( i - 1 , i )- r i(hj)yQ( i - 1 , i ) <A- 91 ) 

with the following initial conditions 

y T (0,j) = e 7 (0, j) T Y r y Q (0,j) = e Q (0,j) T Y r . (A.92) 
The final outputs for the detection will be 

frrk-i = Vitt - 1> i) \?r\x = VQ(> - hi) l<i<L t . (A.93) 

In examining the recursions needed for the upper triangular model parameters and the observa- 
tions it is apparent that the orthogonal vector pairs, q 2 j_i and q 2i , only have to be computed up 
to the % < L t — 1 level. 

A couple examples will help illustrate the proposed recursions. First consider again the L t = 2 
case and here the preprocessing algorithm has the following pseudo code 

1) Layer 1 

a) Compute er(0, 1) - Complexity 0(2L r ), 

b) Compute y/(0, 1) - Complexity 0(2L r ), 

c) Compute 2/<g(0, 1) - Complexity 0(2L r ), 

2) Layer 2 

a) Compute cr (0,2) - Complexity 0(2L r ), 

b) Compute yi(0, 2) - Complexity 0(2L r ), 

c) Compute y Q (0,2) - Complexity 0(2L r ), 

d) Compute r/(l,2) - Complexity 0(2L r ), 

e) Compute r Q (l,2) - Complexity 0(2L r ), 

February 1, 2008 DRAFT 



33 



f) Compute <x(l,2) - Complexity 0(3), 

g) Compute yi(l,2) - Complexity 0(3), 

h) Compute ?/q(1,2) - Complexity 0(3). 

Again as noted above the algorithm for L t = 2 has complexity 0(16L r + 9). For the L t = 4 
case the preprocessing algorithm has the following pseudo code 

1) Layer 1 - Same as L t = 2 case, 

2) Layer 2 - Same as L t = 2 case, 

3) Layer 3 

a) Initialization cr(0, 3), yi(0, 3), 2/q(0, 3) - Complexity 0(6L r ), 

b) First recursion r/(l, 3), rg(l, 3), cr(l, 3), 3), j/q(1, 3) - Complexity 0(4L r + 9), 

c) Second recursion rj(2, 3), r Q (2, 3), er(2, 3), y/(2, 3), ?/q(2, 3) - Complexity 0(4L r + 
15), 

4) Layer 4 

a) Initialization cr(0, 4), yj(0, 4), j/q(0, 4) - Complexity 0(6L r ), 

b) First recursion r 7 (l, 4), r Q (l, 4), a(l, 4), ^(1, 4), y Q (l, 4) - Complexity 0(4L r + 9), 

c) Second recursion r 7 (2, 4), r Q (2, 4), cr(2, 4), y/(2, 4), y Q (2, 4) - Complexity 0(4L r + 
15), 

d) Third recursion r/(3, 4), r Q (3, 4), a(3, 4), y 7 (3, 4), y Q (3, 4) - Complexity 0(4L r +15), 
The overall complexity for L t = 4 case is 0(48L r + 72). In general the complexity of the 
preprocessing algorithm is O (2L r Lf + AL t L r ), the same order of magnitude reported in Section 
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Fig. 4. Performance comparison of detection algorithms. L t = 2, L r = 2 antennas, BICM MIMO-OFDM, convolutional code, 
channel B and D. 
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Fig. 5. Performance comparison of detection algorithms. 64QAM CR 5/6, L t = 2, L r = 3, BICM MIMO-OFDM, convolutional 
code, channels B, D. 
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Fig. 6. Performance comparison of detection algorithms. Lt = 3, L r — 3, BICM MIMO-OFDM, convolutional code, channel 
model D. 
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Fig. 7. Performance comparison. L t = 4, L r = 4, 64QAM CR 5/6, BICM MIMO-OFDM, CC and LDPCC, channel B and 
D. 



February 1, 2008 



DRAFT 



42 




February 1, 2008 



DRAFT 



43 




■I0- 4 I i i i i i i i i i i i i i 

22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 

SNR 



Fig. 10. Performance comparison of MIMO-OFDM GC, Sphere Decoded, and CC BICM, 8 bpcu, channel model D. 
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